Lineage-informative microhaplotypes for recurrence classification and spatio-temporal surveillance of Plasmodium vivax malaria parasites

Challenges in classifying recurrent Plasmodium vivax infections constrain surveillance of antimalarial efficacy and transmission. Recurrent infections may arise from activation of dormant liver stages (relapse), blood-stage treatment failure (recrudescence) or reinfection. Molecular inference of familial relatedness (identity-by-descent or IBD) can help resolve the probable origin of recurrences. As whole genome sequencing of P. vivax remains challenging, targeted genotyping methods are needed for scalability. We describe a P. vivax marker discovery framework to identify and select panels of microhaplotypes (multi-allelic markers within small, amplifiable segments of the genome) that can accurately capture IBD. We evaluate panels of 50–250 microhaplotypes discovered in a global set of 615 P. vivax genomes. A candidate global 100-microhaplotype panel exhibits high marker diversity in the Asia-Pacific, Latin America and horn of Africa (median HE = 0.70–0.81) and identifies 89% of the polyclonal infections detected with genome-wide datasets. Data simulations reveal lower error in estimating pairwise IBD using microhaplotypes relative to traditional biallelic SNP barcodes. The candidate global panel also exhibits high accuracy in predicting geographic origin and captures local infection outbreak and bottlenecking events. Our framework is open-source enabling customised microhaplotype discovery and selection, with potential for porting to other species or data resources.

Supplementary Figure 1.P. vivax incidence map illustrating regional country groupings.
The baseline P. vivax incidence map was derived from the Malaria Atlas Project (MAP) and presents the number of newly diagnosed P. vivax cases per 1,000 population in 2020 1 .The labelled, dashed black lines indicate the boundaries of the geographic regions included in the analyses: SAM (South America), AF (Africa), WAS (West Asia), WSEA (West Southeast Asia), ESEA (East Southeast Asia), MSEA (Maritime Southeast Asia) and OCE (Oceania).Pins indicate the countries included in each regional grouping, with red pins indicating countries that were represented with samples in the marker discovery dataset (from Pv4), blue pins for those that were part of the external validation dataset (non-Pv4), and purple pins for countries that were present in both datasets.Most of Africa does not have endemic P. vivax due to widespread duffy negativity and is generally geographically restricted to the Horn of Africa.The data represented in this study captures the majority of countries where P. vivax is prevalent.

Supplementary Figure 3. Correlations between microhaplotype and genomic estimates of IBD in Pv4 dataset by region.
Panels a) to g) represent Africa (AF), East Southeast Asia (ESEA), Maritime Southeast Asia (MSEA), Oceania (OCE), South America (SAM), West Asia (WAS), and West Southeast Asia (WSEA).The microhaplotype and whole genome sequence (WGS) IBD estimates reflect pairwise estimates at the High-diversity SNP microhaplotype panel and a set of 898,448 genome-wide SNPs.All calculations were performed using hmmIBD on the 615 monoclonal sample set.Correlations were assessed with Spearman's rho statistic (using a paired test) and presented with the associated p-value.At an alpha of 0.05, significantly positive correlations were observed in all regions.Comparisons were undertaken between the 494 SNPs in the High-diversity 100 microhaplotype panel (MHAP-3_10), the 38-SNP Broad barcode (BR38), and the 33-, 50-and 55-SNP GEO panels (GEO33, GEO50 and GEO55 respectively).The median Matthews correlation coefficient (MCC) summary statistics are based on 500 repeats of the stratified 10-fold cross-validation using the Bi-Allele Likelihood (BALK) classifier in a set of n = 799 biologically independent samples from 21 countries (each with n ≥4) 2 .

Region % Isolates
a calibrated BAM file of each sample.GATK4.3HaplotypeCaller was used to perform variant calling on the calibrated BAM files to yield GVCF files, and finally joint variant calling utilizing GATK4.3GenotypeGVCFs was conducted on those GVCF files to obtain raw merged VCF data.
The raw VCF data was annotated for minimum depth of 5 to successfully call a genotype (<5 reads defined as a genotype failure/missing genotype), and a minimum depth of 2 reference and 2 alternate alleles to define a genotype as heterozygote; these annotations were conducted using an in-house Python script contained within the vivaxGEN NGS pipeline.The annotated VCF data were then filtered for the 1,303,984 known variants in Pv4 set using bcftools view command.

Neighbor-joining tree construction and FWS calculation
The 827 non-Pv4 samples with <50% genotype failures across the selected microhaplotype SNPs (494 biallelic SNPs within the exemplar 100-microhaplotype high diversity panel) were selected for neighbor-joining analysis, resulting in 324 samples from 19 countries as outlined below: The 324 filtered samples in the independent validation dataset were combined with 1,031 Pv4 high-quality samples (samples that were annotated as Analysis_set in the Pv4 metadata; https://www.malariagen.net/data_package/open-dataset-plasmodium-vivax-v4-0/).The combined set was then filtered for variants with sample missingness <0.05 (5%), filtered for variants with minor allele count (MAC)=2, filtered for samples with variant missingness <0.5 (50%), and re-filtered for variants with MAC=2 again, consecutively.The final filtered dataset comprised 329,371 biallelic variants and 1,354 samples (of which 324 were independent validation samples and 1,030 Pv4 samples).All filtering was performed using bcftools view command.
A distance matrix based on major allele proportional genetic distance was calculated from the filtered dataset (n=1,354 samples), and a neighbor-joining tree was generated and plotted using the R software ape package 13 .The branches in the neighbour-joining tree were then coloured according to (i) country of origin for both Pv4 and independent validation samples, (ii) country of origin for independent validation samples only, and (ii) regional classification as per the Pv4 genetic clustering patterns 3 .The FWS metric, which provides a measure of within-host inbreeding that can be used to gauge whether an infection is likely polyclonal or monoclonal, was calculated on the same dataset using the R software moimix package (https://github.com/bahlolab/moimix).As per the Pv4 polyclonal classifications 3,14 , an FWS threshold of <0.95 was used to classify infections as likely being polyclonal.

Identity-by-descent estimation
Correlation between microhaplotype-based identity-by-descent (IBD) and whole genome-based IBD was calculated on the filtered independent validation samples belonging to the four regional groups (AF, ESEA, SAM and WAS) that comprised >30 samples, for a total of 288 samples as detailed below.*Note that samples from Panama were excluded from the SAM region owing to a clonal expansion that could bias the IBD evaluation 4 .

Region
Using the 494 biallelic SNPs within the 100-microhaplotype high diversity panel, IBD was estimated using hmmIBD version 3 15 , with additional parameters min_snp_sep =0 and max_all=100.Genome-wide estimates of IBD were conducted using hmmIBD with the default values.The correlation between the microhaplotype and genome-wide IBD estimations was illustrated with scatter plots comprising colour-coding by pairwise-missing data/genotype failures at the 494 microhaplotype-derived SNPs (i.e.missingness when one or both samples in a pair had a genotype failure/missing call).Scatter plots were generated using an in-house Python script.The magnitude and significance of the correlations was assessed using Spearman's rho statistic (using a paired test).

Supplementary Table 1. Regional patterns of within-host infection diversity.
Fws in the COI=1 infection group is not larger than in the COI>1 infection group.
Measures were conducted on n=922 high-quality biologically independent samples from Pv4. *Pearson's Chi-squared test with Yates' continuity correction.**1-sided Mann-Whitney U testing the hypothesis that the